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lEFFICIENT  STEING  MATCHING  WITH  k  MISMATCHES 


Ckid  M.  Landau*^ 
Uzi  Viskkin*^ 


ABSTRACT 

Given  a  text  of  length  n,  a  pattern  of  length  m  and  an  integer 
k,  we  present  an  algorithm  for  finding  all  occurrences  of  the  pat- 
tern in  the  text,  each  with  at  most  k    mismatches.  The  algorithm 
runs  in  0{k[mlQgTn  +  n)  time. 

1.  INTEODUCTION 

The  problem  of  string  matching  xuith  k  misTnatchss  is  defined  as  follows. 
Suppose  we  are  given  a  text  of  length  n  ,  a  pattern  of  length  m  and  an  integer  k  . 
Find  all  occurrences  of  the  pattern  in  the  text  with  at  most  k  location  in  which 
the  text  and  the  pattern  have  different  symbols.  Note  that  the  case  fc  -  0  is  the 
extensively  studied  string  matching  problem.  Let  us  mention  a  few  notable  algo- 
rithms for  the  string  matching  problem:  linear  time  serial  algorithms  -  i_BM], 
[GS],  [KMP],  [KR]  (a  randomized  algorithm)  and  [Y],  parallel  algorithms  [G]  and 
[V].  The  problem  has  a  strong  pragmatical  flavor.  In  practice,  we  often  need  to 
analize  situations  where  the  data  is  not  completely  reliable.  SpecLfically,  con- 
sider a  situation  where  the  strings  which  are  the  input  for  our  problem  contain 
errors  and  we  still  need  to  find  cdl  possible  occurrences  of  the  pattern  in  the 
text  as  m  reality.  Assuming  some  bound  on  the  number  of  errors  would  clearly 
imply  our  problem. 

We  present  an  algorithm  for  string  matching  with  k  m.i3matches  which  runs  in 
time  0{k[mlQgm.  +  n))  on  a  random-access-machine  (R^VI)  [AHU]. 

After  all  the  results  m  the  present  paper  have  been  achieved,  A.  Slisenko 
has  brought  to  our  attention  the  paper  [I]  in  which  another  algorithm  for  the 
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same   problem  has  been  given.  Ivanov  claims   that  his  algorithm  runs  in  time 
0{f  {k){n+m)),  where  f  {k)  is  a  function  ot  k .    f  (k)  is  described  by  a  combina- 
tion of  two   intricate   recursive  inequalities.   No  additional   hints   regarding  the 
behaviour  ol  f  {k)  were  found  in  his  paper.  We  were  unable  to  solve  these  mequli- 
ties.  However,  we  managed  to  show  that  f  (k)  is  bounded  from  below  by  2*^  for 
every  positive  integer  k.    It  might  be  that  f  {k)  grows  even  subtantially  faster 
than  2*.  His  algorithm  runs  faster  than  ours  only  when  k   is  very  small  and  m 
and  n  are  almost  of  the  same  order  of  magnitude.  In  all  other  cases,  our  algo- 
rithm is  faster.  An  even  more  important  advantage  of  our  algorithm  is  that  it  is 
simple  and  intuitive  while  Ivanov' s  algorithm  is  very  complicated.  (  Its  descrip- 
tion needed  over  40  journal  pages). 

II.  ANALYSIS  OF  THE  TEXT 

Our  algorithm  has  two  parts.  In  the  first  part  the  pattern  is  analyzed.  The 
outcome  of  this  analysis  is  used  m  the  second  part  for  analyzing  the  text.  The 
next  section  describes  the  first  part.  The  present  section  is  devoted  to  the 
second  part.  We  show  how  to  use  the  results  of  the  pattern  analysis  in  order  to 
find  all  occurrences  of  the  pattern  m  the  text  with  at  most  k  mismatches. 

The  input  to  the  text  analysis  consists  of  the  following: 

a)  The  patteren.  An  array  4  =  a^  .  .  .  ,  a^. 

b)  The  text.  An  array  T  =  t  ^,  .  .  .  ,  t^. 

c)  The  output  of  the  pattern  analysis.  A  two  dimentional  array 

PAT -MISMATCH  [I ttl -1;1,...,2A:  +  l].  Where,  row  i  of  the  array 

{FAT -MIS  MATCH  {i.l) PAT -MISMATCH  {i. 2k +  1)).     contains     the     2A:  +  1 

first     locations     in     which     cl^^.^,  .  .  ,  ,  a^      has      different     sjnnbols     than 
ai a^_i   .    (PAT -MISMATCH [i .v)  =/   means  that  a^+f  ^  a.j   and  /   is 

the  mismatch  number  v  from  left  to  right). 

If      there     aire     only    c   <  2,fc-i-l     mismatches     between    c^  +  i,  .  .  .  .a^     and 
dj,  ,  ,  .  ,o.jp.-\  ^'ve  enter  the  default  value  m  +1  from  location  c  -Hi  on.  That  is, 
PAT  -MISMATCH  [i.c  +1)  =  PAT-MISMATCH[i  .Zk  +1)  =  tth-:. 

The  text  is  analyzed  into  the  array   TEXT -MISMATCH  [Q n-m;: k  +  -]. 

roiiovving  1.1J.S  i,eA.v.  anaivsii,  r^"  i  oi.  Lij.g  a.r.  a.;/  i^iiivii  — .m ^ .<ir\i  oij  i,c,_y,..., 
TEXT —MIS MATCH 'yT.X^'S)).  contains  the  A:+l  first  mismatches  between  the 
strings  ^^  +  1,  .  .  .  .i^i+m    3i^'i  '^i ^m-    {TEXT -MISMATCH [i .v)  =  f   means  that 
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ti^f  it  a. J  and  this  is  mismatch  number  v   from  Left  to  right).  If    there  are  only 
c   </fc  +  l  mismatches  between  ^i  +  i,  ....  i^+m  and  ai,  .  .  .  ,a^  then  we  enter  the 
default  value  m  +  l  from  location  c  +1  on.  That  is, 
TEXT -MIS  MATCH  {i,c->r  I)  =  TEXT -MISMATCH  {i.k +  1)  =  m+l. 

Remark.  This  solves  our  problem  since  TEXT -MISMATCH  {iM^l)  =  m  +  l 
means  that  there  is  an  occurrence  of  the  pattern  which  starts  at  t^^^  with  at 
most  A:  mismatches. 

We  start  with  a  very  high-level  specification  of  the  algorithm.  It  is  explained 
by  the  verbal  and  illustrative  descriptions  that  follow. 

TEXT-ANALYSIS 

Initialize:       TEXT -MIS  MATCH  [0 Ti-m,l k  +  l]  :-  m  +  l  ; 

r;=0;;:  =  0; 
fori:  =  0  to  TL-m    do 

begin 

6:  =  0; 

if  i  <  j 

then    MERGE (i,r,;  ,6); 

if  6  <  ;fc   +  1 

then     T  :■=  i  .  EXTEND(i  j'.b  ) 
end 

The  for  loop  is  responsible  for  "sliding"  the  pattern  to  the  right  one  place 
at  a  time.  At  iteration  i,  we  check  if  an  occurrence  of  the  pattern  starts  at  i^^.l. 
Suppose  that  r  is  an  iteration  prior  to  i,  (  0  <  r  <  i),  that  maximizes 
j  =r  +  TEXT  -MISMATCH  {r  .k  +1).  Namely,  ;  is  the  rightmost  index  of  the  text 
to  which  we  arrived  at  previous  Itarations  of  the  loop.  Each  iteration  consists  of 
calling  procedure  MERGE,  (if  i  <  j).  and  possibly  procedure  EXTEND,  (Note,  that 
at  the  begining  i  =  0,  j  =  0,    and  therfore   MERGE  is  not  invoked,   at   the   first 

iteration).  MERGE  finds  mismatches  between  t,^.^,  .  .  ^  ,  tj   and  a^ aj_^  and 

reports  m  b   the  number  of  mismatches  found.  If  b  >  k  -   1  we  proceed  to  the 
next  iteration.  Otherwise,  EXTEND  scans  the  text  from  fj  +  i  on  till  it  either  finds 
fc+1  mismatches  or  till  it  hits  f^^.^  and  finds  that  there  is  an  occurrence  of  the 
pattern  which  starts   at  ^^  +  i  with  at  most  k   mismatches.   The  situation  is  illus- 
trated m  FLi 


!  rm  y  o 


Let  us  explain  the   role  that  procedure  MERGE  plays  at  iteration  i  of  the 
TEXT-ANALY3I3.      In    the    previous    paragraph    we    stated    that      MERGE    finds 
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mismatches  between  t^^i tj  and  n^ tij_^  and  reports  in  b  the  number 

of    mismatches   found.   That   is,   MERGE    computes    TEXT-MISMATCH[i\\ b\ 

(fa  <  fc  +  1).  MERGE  uses  two  kinds  of  data  that  were  computed  in  iterations 
prior  to  i  of  TEXT- ANALYSIS . 

(a)  The  mismatches  with  respect  to  (in  short,  w.r.t.)  r  +  1  in  the  text.  Obvi- 
ously, such  mismatches  which  occur  in  locations  <i+l  in  the  text  are 
irrelevant  for  checkmg  whether  there  is  an  occurrence  of  the  pattern  that 
starts  at  t^^^.  Let  q  be  the  smallest  integer  satisfying 
TEXT-MISMATCH[r.q]  is  greater  than  i-r.  Thus,  MERGE  uses 
TEXT-MISMATCH[r\q >fc  +  l].  (Fig.  1(b)  ). 

(b)  PAT-MISMATCH[i-r\l s],   where   s    is  the    rightmost  mismatch   in 

PAT -MIS  MATCH  [i-T\l 2k +l]  such  that   PAT -MISMATCH  [i-r  .s)  is  less 

than  (j'  -  i  +  1).  (Fig.  1(c)  ). 

We  apply  a  case  analysis  m  order  to  understand  how  to  use  these  previously 
computed  data.  We  need  the  following  two  conditions  for  the  case  analysis.    Con- 
sider any  location  x  of  the  text,  i  +  l<x<j.  We  define  two  conditions  on  x  . 

Condition  Ix  falls  under  a  mismatch  w.r.t.  r.  That  is,  i^^^a^.r  and  for  some 
d.  (?<a:<fc+l),  z  -r  =  TEXT  -MISMATCH  (r  .d) .  (Thus  correspond  to  a  mismatch 
between  two  locations  one  from  the  bottom  line  and  the  other  from  the  middle 
line  in  Fig.  1(d)  ). 

Consider  laying  one  copy  of  the  pattern  starting  at  ^^+1  and  another  copy  start- 
ing at  ^t+i-  (The  upper  and  m.iddle  lines  in  Fig.  1(d)  ). 

Condition  2.  x  falls  under  a  mismatch  between  these  two  copies  of  the  pattern. 
That      IS      a,_^  ?i  a-_^.      Also,      x  -  i  =  PAT-MI5MATCH{i-r  J)         for      some 
/.(  1^/  ^O. 
Location  x  may  satisfy  either  both  conditions  or  any  one  of  them  or  none. 

We  are  ready  now  to  present  the  case  analysis  for  any  location  of  the  text 
x,i-i-l<x^7,  and  how  it  affects  the  question:  t^  =  a.^—^  '^    (In  words,  does  loca- 
tion X  of  the  text    match  location  x  —  i  of  the  pattern?) 

Case  0.  X  does  not  satisfy  Con.dition  1  aoid  x  also  does  not  satisfy  Condition  2. 
Location  x  of  the  text  must  match  location  x  —i  of  the  pattern  (t.  =  2ir-i)  and 
we  need  not  bother  to  compare  t^  and  a__^.  (A  similar  argument  is  used  m  the 
algorithm  of  ^lOIP]). 
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Cass  1.  X  satisfies  one  of  the  tjwo  conditions  and  does  not  satisfy  the  other.  Let 

usjustify  why  f^  t^  a^-i    In  any  of  these  two  possibilities. 

If  Condition  1  holds  and  Condition  2  does  not  hold  then  t^  *o.^-t  ^nd  a__^  =  ci-.^. 

Therefore,  t,^xi^_^.    If  Condition   1  does   not  hold  and  Condition  2  holds   then 

fj  =aj._r  and  0..^-^  5*  '^x-x-  Therefore,  tj.^a^_i^. 

3o,  we  know  that  there  must  be  a  mismatch  at  location  x  and  again  we  dispense 

with  comparing  t^   and  a^.j.  However,  we  do  need  to  increase  the  counter  of 

mismatches  b  by  one  and  update  TEXT-MI5MATCH{i ,b) . 

Case  2.  x  satisfies  both  conditions.  Here  we  are  unable  to  reason  whether 
t^  --  a^-i  or  not.  So,  we  compare  these  two  symbols.  If  they  are  different  we 
update  b  and  WXT-MISMATCHii  .b)  as  in  Case  1. 

Specifically,   procedure    MERGE    operates    as    if   it    merges   the    Increasing 
sequence  of  <fc-+l  locations 

T-^TEXT-MISMATCE{T,q),...,r-^  TEXT  -MISMATCH  [r  ,k  +  1) 

and  the  increasing  sequence  of  :<2A:  +1  locations 

i  +  J^A  T  -  MISMATCH  ( i  -r ,  1 ) i+PAT-  MISMA  TOR  [i-r  ,s) 

into  one  increasing  sequence.  However,  instead  of  explicitly  merging  the  two 
sequences  MERGE  checks  whether  each  location  satisfies  Case  1  or  Case  2  and 
treats  the  location  accordmg  to  the  case  analysis  given  above. 

Procedure  MERGE  (i  ,7- ,7  ,6  ) 

Input  :      I)  TEXT-MI5MATCH[r:q >fc+l] 

2)  PAT  -MISMATCH  [i  -r ;  1 s] 

Initialize  :  d:-  g  :  f  :=  1 

(♦The  variable -d  will  be  used  in  the  form  TEXT -MISMATCH  {r  .d).  Initially  it  is  q 
and  then  it  is  increased  by  one  at  a  time.  The  variable  /  will  be  used  In  the  form 
PAT-MISMATCH[i-T  J).  Initially  it  is  1  and  then  it  is  increased  by  one  at  a 
time.  *) 

while     not  'Case  a  or  Case  b  or  Case  c]  do 

(*  We  stop  iterating  the  while  loop,  and  return  control  to  TEXT-ANALYSIS,  m  any 
of  the  following  cases: 

Case  a.  fa=A;+l.  This  means  that  we  have  already  found  k  +  '_  mismatches  with 
repect  to  i. 

Case  b.  d  =  k  +2.  When  d  was  assigned  with  A:  + 1  then  m  the  middle  line  we  were 
exactly  over  location  j  of  the  bottom  line.  A  careful  observation  at  the  way  m 
which  d  IS  updated  m  procedure  VERGE  reveals  that  the  fact  that  d  was 
increased  to  k-b2  implies  that  m  the  middle  line  we  must  have  also  passed  loca- 
tion ;  of  the  bottom  line,  and  therefore  it  is  time  to  return  control  to  TEXT- 
ANALYSIS  and  continue  the  search  for  mismatches  by  procedure  EXTEND. 
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Case  c.  [i  +  PAT-M!5MATCH{i-r,f)>-j  and  TEXT -MISMATCH  {r  .d)~  m^\]. 
The  first  conjunct  means  that  in  the  upper  line  of  Fig  1(d)  we  have  already 
passed  location  ;  of  the  bottom  line.  The  :second  conjunct  means  that  there 
were  an  occurrence  of  the  pattern  at  ^r  +  i  with  d—l  mismatches  and  in  the  mid- 
dle Une  of  Tig.  1(d)  we  have  also  already  passed  the  location  ;  of  the  bottom 
Une.*) 

begin  . 

if       i  +  PAT -MISMATCH  [i  -Tj):>r  +  TEXT— MISMATCH  {r  .d) 

(*  Case  1:  Conditiori  1  is  satisfied*) 

then 

fa:=  &+  1  ; 

TEXT-MISMATCH{i  .b)=TEXT  -MISMATCH[r  ,d)  -  {i  -  t)  ; 

d:=  d  +  1  ; 
slse 

If    i  +  PAT  -MISMATCH  (i  -r.f)  <  r  A-  TEXT -MISMATCH  [r  .d) 

{*  Case   1;  Condition  2  is  satisfied*) 

then 

fe:=  b    +  1  ; 

TEXT-MISMATCH{i.b  ):  =  PAT-MISMATCH{i  -rj); 

f=f  +  1; 
else 

(*i  +  PAT -MIS  MATCH  [i-r  J)  =r  +   TEXT  -MISMATCH  [r  .d)  *) 

{*Case  2  *) 

if  0.pif_iusjiATCH{i.-r.J)^  ^i.+PAT -MISUATCH{\-t .f) 

then 

6:=  6  +•  1  ; 

TEXT  -MIS  MATCH  {i  ■b):=PAT  -MISMATCH  {i  -r.f): 
/:=  /   +  1  ,  d:=  d  +  1 
snd 


Correctness  of  procedure  MERGE.   Consider  iteration  i. 

Claim.  If  there  are  >  A:  +  1  mismatches  in  locations  <;  then  MERGE  finds  the  first 
fc-t-l  of  them.  If  there  are  <k-¥l  misxnatches  in  locations  <;  then  MERGE  finds  all 
of  them. 

Proof  of  claim.  Condition  1  holds  for  <A:+1  locations,  which  are  >i  and  ^;  .  Let 
y  be  the  number  of  locations  m  this  range  for  which  Condition  2  holds  We  do  not 
know        anything         about        y.  Suppose        PAT  -MISMATCH  [i-r  ,i)  ,  .  .  .  , 

PAT  -MISMATCH  [i  -r  .y)  had  had  included  all  mismatches  between  two  copies  of 
the  pattern  which  are  i-r  apart.  Then,  by  our  case  analysis,  MERGE  could  have 
found       all      mismatches       in      the       range       between      i  +  1       and      j        But 

PAT—MISMATCH[i-r.l 2k  + 1]  contains  no  more  than  2i+l  mismatches.    We 

have  to  show  that  we  never  need  more  than  this  for  the  Claim  to  hold.  If 
PAT -MISMATCH  [i-r  .2k  +l)>j -i  then  we  have  all  mismatches  between  the  two 
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patterns  for  which  Condition  2  tioLds  for  locations  -^j  in  the  text  and  the  claim 
follows.  The  remaining  case  is  when  PAT  -MISMATCH  (i-r  ,2k  ■\-  l)<j-i  .  This 
gives  2h  +  l  locations,  which  are  >i  and  <j  .  for  which  Condition  2  holds.  Recall 
that  Condition  1  hold  for  </fc  locations  in  this  range.  Therefore,  there  are  ^k  +  l 
locations,  which  are  >i  and  <j  ,  for  which  Condition  2  holds  and  Condition  1  does 
not  hold.  All  these  locations  satisfy  Case  1.  Therefore,  they  suffice  to  establish 
that  there  is  no  occurrence  with  -^k  mismatches  starting  at  f^^.^  and  the  claim 
follows. 

Procedure    "EXTEND     finds     mismatches     between     tj^.^^ fi+^      and 

ay_i  +  i CL^,  hy  comparing  proper  pairs  of  symbols  from  the  pattern  and 

the  text  in  the  naive  way.  EXTEND  stops  once  it  finds  the  k  +•  1-st -mismatch.  If 
there  is  an  occurrence  of  the  pattern  with  at  most  k  mismatches  which  starts  at 
^^+1  then  EXTEND  stops  at  t^^^.^  after  it  finishes  verifying  this  fact. 

Procedure  EXTEND  (i ,;  ,  6 ) 
while  {b   <  fc  +  l)  and  {j—i  <  m)         da 
begin 

J  :=;  +  i 

if  tj  ^  aj-i 

th2Tt  b=b   +  1  :  TEXT -MIS  MATCH[i. b]-  j-i  ; 
end 

Complexity.  The  running  time  of  TEXT  —ANALYSIS  is  0{Tik)  .  For  each 
iteration  i  (0<  i  -^n-m.)  the  opreations  in  TEXT-ANALYSIS  excluding  MERGE  and 
EXTEND       take       0(1)       time.       MERGE       treats       entries       of       the       form 

PAT-MISMATCH[j-r\l SAr-t-l]    (whose   number   is    Zk+l)   and   entries   of  the 

form  TEXT -MISMATCH  [t:1 k-hl]  (whose  number  is  k  +  i).  Each  of  the  opera- 
tions of  MERGE  can  be  charged  to  one  of  these  3k-2  entries  m  such  a  way  that 
each  entry  is  being  charged  by  0(1)  operations.  Therefore,  MERGE  requires  0(A:) 
time.  The  total  number  of  operations  performed  by  EXTEND  throughout  all  the 
iteratios  is  C(n)  since  it  scans  each  symbol  of  the  text  at  most  once.  So,  we  get 
in  total  0(n(l  +  t-   +:))=    Q{nk). 

m.  ANALYSIS  OF  THE  PATTERN 

In  this  seccion  we  describe  the  pattern  analysis.  in  which 
PAT-MISMATCH[1 m.-l;l 2A:+l]  lii  computed. 

Let  [1 m-l]  be  the  set  of  m-1  rows  of  PAT-MISMATCH.  Assume,  wl.g.. 

that  m  is  some  power  of  2.  The  algorithm  uses  a  partition  of  this  set  into  [ogzm 
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sets  as  follows: 

[1], [2,3], [4,5, 6, 7], [8. ..15] [m/ 2,...,m. -l]. 

The  pattern  analysis  has  log  m    stages; 
Stage  1 .    1  <.  I  ^Lag  m.    Compute  PAT-MISMATCH  for  the  rows  of  set  L.  (Where, 
set  Z  ,  1  <Z  ^Lag  m,  is  [2'"' 2'  -  l].) 

We  describe  m  caore  detail  the  last  stage  (stage  log  m  )  and  discuss  briefly 
later  how  to  extend  the  same  technique  for  the  earlier  stages.    Essentially,  we 
apply  the  text  analysis  algorithm  of  the  previous  section.  In  order  to  keep  this 
presentation   shortv  we   oven/iew  the   similarities   to  the    text  analysis  and   ela- 
borate only  on  the  differences. 

The  input  to  stage  log  tti  of  the  pattern  analysis  consists  of  the  following: 

a)  The  array  a,,  .  ,  .  ,  Orn/a.  which  plays  the  role  of  the  pattern  (m  the  text 
analysis). 

b)  The  array  Om/z  +i t^m.  which  plays  the  role  of  the  text. 

c)  The    two    dimentional   array   FAT -MISMATCH  [l Tn/2-l.l 4fc  +  l], 

which  IS  the  output  of  the  previous  log  m  -1  stages  of  the  pattern  anedysis. 

The  output  of  stage  log  m  is  PAT -MISMATCH[m/  2.....m-i.i....,2k +l]. 

Below,    we   give    a  very  high-level   specification  of   stage    log  m    of   the  pattern 
analysis. 

Ijiitialize:       PAT-MISMATCH[m/  2....  .m-l:l 2>fc+l]  :=  m.+l  ; 

r:  =  77i/2    ;   ;'  :=m/  2; 

for  i:-m./S  to  vn-l     do 

begin 

6:  =  0; 

if  i  <j 

then  MERGE(i.r,j,&  ); 

if  b  <2k  +  1 

then     r  ■=  i  :  EXTEND(i.;  , &) 
2nd. 

One    important    difference    with    respect    to    the     text    analysis    needs    to    be 
emphasized: 

In  TEaT-A.\'ALY3I3  we  were  after  the  k-ri  first  mismatches  for  each  location 
of  the  text,  while  here  we  want  to  find  the  2.t  + 1  first  mismatches.  The 
correctness    proof   of   iteration  i    of    procedure    MERGE,    in    the    previous 
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section,  needed  the  first  2A:  +1  locations  for  which  Condition  2  holds  in  order 
to  find  these  first  k+l  mismatches.  A  careful  check  of  the  proof  will  show 
that  the  first  ^k+l  locations  were  Condition  2  does  not  hold  would  have 
sufficed  for  finding  the  first  fc+i  mismatches,  as  required  here.  This 
explains  item  c)  in  the  input  for  stage  logm  . 

Next,  we  describe  briefly  stage  I,  [l  ■^  I  <logrn),  by  emphasizing  the 
differences  with  respect  to  stage  log  m  which  was  described  above. 

The  input  to  stage  L  of  the  pattern  analysis  consists  of  the  following: 

a)  The  array  a^ a     _  ji-i'  which  plays  the  role  of  the  pattern  (in  the 

text  analysis). 

b)  The  array  a^j-i^^,  .  .  .  ,  a^,  which  plays  the  role  of  the  text. 

c)  The  two  dimentional  array 

PAT -MISMATCH  [1 2'-'-l;l min(2'°g  "^ -'4fc +l,m  -2'"^)],     which    is 

the  output  of  the  previous  i-1  stages  of  the  pattern  analysis. 

The  output  of  iteration  i  at  stage  I  {2^~^  ■<  i  ■<  2''  -  1)  is 

PAT  -  M  IS  M AT  CH  [i  \  I..... mm{2'^°s"'-^  2k  +  I. m  -  i)] 

We  note  three  difTerencss  m  this  stage,  with  respect  to  stage  log  m: 

a)  A.t  stage  i  the  for  loop  is  far  i:  =  2'~Wo  2'  — 1.  At  each  iteration  i,  we  look 
for  the  mismatches  between  Oi+i,  .  .  .  ,  a^  and  aj,  .  .  .  ,ci^_i, 
(2'~^  <  i  <  2^  —  1). 

b)  Iteration  X  of  stage  I  looks  for  (min(2A:2'°''"~'  +  1  .  m.  -  i))  mismatches. 

c)  The        output        of       stages        1 L-1        must       give        the        first 

(min(4A:2'°s'"  "'  +  1  ,  m.  -  2'"^))  mismatches.  - 

Complexity.  For  each  iteration  i  at  stage  I  (  2'~^<  i  <  2'  —  1), 
(1  <  i  <logm.)  the  operations  in  the  "mam  program"  excluding  MERGE  and 
EXTEND  take  0(1)  time.  As  m  the  privious  section  MERGE  requires  0("number  of 
mismatches  we  look  for  ")  time.  Hare  it  means  0(2-1-2'°='^  "'-)  time.  The  total 
number  of  operations  performed  by  EXTEND  throughout  all  itaratios  of  stage  i  is 
0{m).  Stage  I  has  2'"'  iterations,  therefore  it  takes 
0{m  +  2'- :Z'c2'°3'^ -'■))=  0(km)    time.     V,^e   have    log  m.    st.ages.    3o,    the   running 

loam 

time  of  the  pattern  analysis  is  0[  2  l^rn)  =  0[km.Logm). 

1  =  1 
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iteration  i 

TEXT-ANALYSIS  checks  whether  there  are  >  k  mismatches  between  the  following 
3trmgs: 


O-T. 


^i  +  l 


*i+r 


Figure  1(a) 

TEXT -MISMATCH[r:q A:-!-l]  gives  all  the   mismatches   between  the  followiag 

-"Stnnas: 


I       "•! 


I     t. 


r+1 


(Ix-r  +  l 


i-i^l 


a 


j-r 


Figure  1(b) 

^PAT-MISMATCH[i-T:l,....s]  gives   all,   (<2A:+l),   the   mismatches   between  the 
Tollowing  strings: 

1 T 


a.s 


1 1 1- 

I    a-i      ' 

I  I     •      •      • 


Oa-r+l 


av_i 


"Im-(i 


I  I 

.      .    1         ^  I 


Figure  1(c) 

"MERGE      uses      the      information      in      Fig.       1(b)      and      1(c)      to      compute 

JEXT-MISMATCH[i\l k^l].   If  MERGE   is    unable  to  complete   this  job  then 

EXTEND  completes  it. 


ai 

. 

a,,. 

Ct;  -I  +  1 

. 

^W. 

1  U,  1 


Oi-r  +  l 


.  _ 


Oj-r 


^i  +  l 


■■j  +  l 


1 — I 


MERGE 


EXTEND 


TEXT-.^NALY3I3 


Figure  i(d) 
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